Statistical parsing for German: modeling syntactic properties and annotation differences

نویسنده

  • Amit Dubey
چکیده

Statistical parsing research can be described as being anglo-centric: new models are first proposed for English parsing, and only then tested in other languages. Indeed, a standard approach to parsing with new treebanks is to adapt fully developed English parsing models to the other language. In this dissertation, however, we claim that many assumptions of English parsing do not generalize to other languages and treebanks because of linguistic and annotation differences. For example, we show that lexicalized models originally proposed for English parsing generalize poorly to German. Even after modifying the models to account for annotation differences, we find the benefit of lexicalization to be far less than in English. With this as a starting point, we take a closer look what effect that linguistic differences between English and German have on statistical parsing results. We find that a number of linguistic elements of German play a more crucial role than lexicalization. For example, adding a relatively simple model of the German case system to parser accounts for more ambiguity than a complex model including lexicalization. Further studies show that lexical category ambiguity accounts for a surprising amount of parsing mistakes, and while a model of morphology we develop gives mixed results, an error analysis suggets that a correct model of morphology would help with resolving harmful and common verb/adjective ambiguities. In addition, we offer a preliminary model of long-distance dependencies, showing this model helps greatly in resolving ambiguities caused by German free word order constructions. We also find that the choice of evaluation metric can have a profound impact on parsing performance: it appears that lexicalized models perform better on dependency-based metrics whereas unlexicalized models perform better on labelled bracketing metrics. Other seemingly arbitrary choices also affect parsing results: the choice of search and smoothing algorithm can potentially obscure helpful linguistic disambiguation cues. The best performing model we develop sets the state-of-the-art performance on the NEGRA and TIGER corpora, with labelled bracketing scores of 76.2 on NEGRA and 79.5 on TIGER. Furthermore, this parser scores 84.0 on dependencies on the NEGRA corpus, also the best reported performance on that corpus, and 86.2 on the TIGER corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust Syntactic Annotation of Corpora and Memory-based Parsing

This talk provides an overview of current work in my research group on the syntactic annotation of the Tubingen corpus of spoken German and of the German Reference Corpus (Deutsches Referenzkorpus: DEREKO) of written texts. Morpho-syntactic and syntactic annotation as well as annotation of function-argument structure for these corpora is performed automatically by a hybrid architecture that com...

متن کامل

Revisiting the Impact of Different Annotation Schemes on PCFG Parsing: A Grammatical Dependency Evaluation

Recent parsing research has started addressing the questions a) how parsers trained on different syntactic resources differ in their performance and b) how to conduct a meaningful evaluation of the parsing results across such a range of syntactic representations. Two German treebanks, Negra and TüBa-D/Z, constitute an interesting testing ground for such research given that the two treebanks mak...

متن کامل

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

Morphological and Syntactic Case in Statistical Dependency Parsing

Most morphologically rich languages with free word order use case systems to mark the grammatical function of nominal elements, especially for the core argument functions of a verb. The standard pipeline approach in syntactic dependency parsing assumes a complete disambiguation of morphological (case) information prior to automatic syntactic analysis. Parsing experiments on Czech, German, and H...

متن کامل

On Detecting Errors in Dependency Treebanks

Dependency relations between words are increasingly recognized as an important level of linguistic representation that is close to the data and at the same time to the semantic functor-argument structure as a target of syntactic analysis and processing. Correspondingly, dependency structures play an important role in parser evaluation and for the training and evaluation of tools based on depend...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005